Rank distributions of words in additive many-step Markov chains and the Zipf law 
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The binary many-step Markov chain with the step-hke memory function is considered as a model 
for the analysis of rank distributions of words in stochastic symbolic dynamical systems. We prove 
that the envelope curve for this distribution obeys the power law with the exponent of the order of 
unity in the case of rather strong persistent correlations. The Zipf law is shown to be valid for the 
rank distribution of words with lengths about and shorter than the correlation length in the Markov 
sequence. A self-similarity in the rank distribution with respect to the decimation procedure is 
observed. 
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The rank distributions (RD) in the stochastic systems 
attract the attention of specialists in the physics and 
many other fields of science because of their universal 
power-law character (the so-called Zipf law (ZL) |l' . Dis- 
covered originally for the RD of words in the natural 
languages, the ZL was later observed in the rank distri- 
butions of other objects, such as distributions of "words" 
in the DNA sequences j^], PC codes j2i|, capitals of stock 
market players (jI (in economics, the Zipf law in slightly 
different form is known as Pareto's principle or the 80- 
20 rule |2|), the population of cities, areas occupied by 
countries, masses of cosmic objects etc (see |5|). In spite 
of a lot of endeavors to describe this phenomenon ana- 
lytically, a deep insight into the problem has not so far 
been gained. 

To define the rank distribution of some objects in a 
definite sequence, it is necessary to establish a correspon- 
dence between objects and their frequencies of appearing 
in the sequence and to arrange the objects in ascend- 
ing order of these frequencies. A choice of the model 
for analytical description of the Zipf law in RD is rather 
ambiguous because of diversity of the systems where it 
occurs 5J. Here the principal question arises concerning 
the way of defining the objects that are involved in a com- 
petition according to the frequency of their occurring in 
the sequence under consideration. There exist two prin- 
cipally different approaches to this problem. The first 
of them consists in considering the objects as a priori 
equivalent, i.e. having "the same rights" in the competi- 
tion. The Zipf-law in rank distributions in such models 
appears only due to the correlations that are present in 
the sequence. The rank distribution of the triplets in the 
DNA sequences can serve as a vivid example of the real 
systems for which this approach is essential (see 01 )• 
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The second approach deals with sequences where the 
correlations does not play an essential role. However, 
the competitive objects have a priori nonequal chances to 
take a given place in the sequence. Mandelbrot's mod- 
els |6|, ITJ, the Kanter and Kessler model [8| and other 
models are constructed on the basis of the choice of the 
a priori nonequivalent competitors, and specifically this 
non-equivalency is a reason for obeying the rank distribu- 
tions to the Zipf law. For example, the non-equivalency 
of the words in literary texts is caused by their different 
lengths and, consequently, by their different statistical 
weight that is defined by a number of characters in the 
word and the capacity of alphabet. 

The nature of the real objects that satisfy the Zipf law 
does not furnish sufficient arguments for giving prefer- 
ence to one of the discussed approaches. It is possible 
that both approaches describe different mechanisms of 
forming the power-law rank distributions. 

In the present paper, we suggest an analytically solv- 
able model of many-step Markov chain where the rank 
distribution of different I/-words (L consequent symbols 
in the chain) of a definite length L is examined. Since 
the words are equal in their length, the rank distribution 
in this system occurs as a result of the correlations. In 
other words, our model is based specifically on the first 
approach to the choice of the competitors in the sequence. 
The study that has been carried out allowed us to re- 
veal the relation of the rank distributions to the existent 
correlations in the system. The speculations about the 
connection of the Zipf law to the long-range correlations 
were expressed clearly by a number of authors [Sl, llfll |. 
We have demonstrated that the short-range correlations 
can also provide the appearance of the Zipf law. 

We have analytically studied the rank distributions of 
words of certain length L in the Markov chains. If a 
Markov chain possesses the one-step-like memory func- 
tion considered in Ref. [ll|, this distribution is shown 
to be of the many-step-like form. In the case of strong 



correlations, the envelope curve for the rank distribution 
obeys the power-law behavior with the exponent of the 
order of unity, i.e. the distribution is described by the 
Zipf law. The obtained results provide us with a sufficient 
amount of information to clarify the origin of Zipf 's law. 
In particular, we have made sure that the correlations of 
symbols within the competitive words is sufficient for the 
appearance of the Zipf law in their rank distributions. 

The suggested approach to the problem of the Zipf 
law is expedient because we are provided by the theoret- 
ical parameters that affect both the character of corre- 
lations and the rank distribution of words occurring in 
the Markov chain. Due to this circumstance, we could 
examine the relationship between the rank distributions 
and the correlation properties of the system. 

Let us consider a homogeneous stationary unbiased bi- 
nary sequence of symbols, at = {0, 1}, and define the 
word as a set of sequential symbols of definite length 
L. Different words are obtained by progressively shift- 
ing a window of the length L by one symbol in the se- 
quence. The rank distribution of words is a relationship 
connecting the probability W of certain word occurring 
to the corresponding rank. The words are ordered in as- 
cending rank order, W{1) > W{2)... > W{2^). Our 
sequence is the TV-step Markov chain with the step-like 
memory function. This means that the conditional prob- 
ability P{ai I Oi-AT, fli-AT+i, . . . , ai_i) of definite sym- 
bol tti occurring (for example, a^ = 0) after symbols 
ai-N, ai-N+ii ■ ■ ■ , o-i-i in the chain is determined by the 
equation. 



P{ai — I ai_Ar, fli^jv+i, ... 
= l/2 + fi{l-2k/N). 
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(1) 



Here k denotes the number of unities among N sym- 
bols, ai^i,ai-2, ■ ■ -CLi-N, preceding the generated one, 
Oi, and fi is the strength of correlations in the sequence, 
— 1/2 < fi < 1/2. The case with n — corresponds to the 
non-correlated random sequence of symbols. The posi- 
tive (negative) values of fi correspond to the persistent 
(anti-persistent) correlation (the attraction (repulsion) of 
symbols of the same kind). 

As was shown in Refs. [ll|, |l^, the probability W of 
certain L-word occurring depends on the number k of 
unities in the word with L < N but is independent of 
their arrangement. It is described by the formula. 
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Since the probability W{k) does not depend on the ar- 
rangement of symbols within the L-word, the specific de- 
generation takes place in the Markov chain under study. 
Another kind of degeneration arises from the non-bias 
property of the sequence: the probability W{k) is sym- 
metric with respect to the change k ^ {L ~ k), W{k) = 
W{L - k). Thus, 2C^ = 2L\/k\{L - k)\ different words 
occur with the same frequency W{k). This results in 
the step-like form of the rank distribution of the L- words 
with L < N . Each step can be labelled by the number 
k < L/2 of unities (or zeros) within them and is charac- 
terized by the length equal to the degeneracy multiplicity 
2C^. The right edge of the fcth step corresponds to the 
rank R{k) which is described by the equation. 



R{k) = 2Y^C\ 



(4) 
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Indeed, performing the ranking procedure (all words 
containing equal numbers of unities k have neighboring 
ranks) we obtain this formula. A pair of Eqs. Q and 
Q being considered as a parametrically defined function 
W{R) represents the envelope curve passing through the 
right edges of the steps in the rank-distribution. 

Using the Stirling formula for the Gamma-functions 
(which is valid at L, fc, [L — k) ^ 1) and changing the 
summation operation in Eq. Q by integration, one can 
easily obtain the asymptotic expression for the depen- 
dence R{W), 



R = 2^ 
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with 



c = 



l + 2n/L' 
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- r(l/2 + n) 
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X exp(-2n-L-f 2). (6) 



The distribution Eq. (0 differs from the usually discussed 
power-law form by the logarithmic multiplier only. If one 
neglects this weak logarithmic dependence, the Zipf law 
for the rank distribution would be obtained from Eq. ISJ, 
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The achieved result is demonstrated in Fig.^ The dot- 
ted line shows the plot of the rank distribution obtained 
from Eq. @ at L = 14, iV = 15, /i = 15/46, n = 4. This 
plot passes closely to the solid line, which demonstrates 
the results of numerical simulations of the rank distribu- 
tion of words of the length L = 14 in the Markov chain 
generated with the same parameters N and /i. The dash- 
dotted line in this figure is the envelope curve Eq. jSJ. 
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FIG. 1: The rank distribution W{R) of the words of the length 
1/ = 14 in the Markov chain with a step-like memory func- 
tion, N — 15, n = 15/46. The solid line corresponds to the 
numerical simulations, dotted line describes the exact distri- 
bution obtained from Eq. Q, dash-dotted line is the plot of 
envelope asymptotics Eq. @, the dash-dot-dotted line de- 
scribes the Zipf asymptotics Eq. ||7|l with ( — 7/11. Thin 
solid line corresponds to anti-persistent correlations obtained 
at /x = -15/46, N = 15, L = 14. 



in the invariance of the Zipf plot with respect to the dec- 
imation procedure. According to Eqs. (0) and lO, the 
slope of the Zipf plot depends on the parameter n only. 
This parameter does not change after the transformation 
N ^ N* , II ^ fi* (see Eqs. Q and ©) . As a result, the 
Zipf plots for rank distributions of L- words with L < N, 
obtained from the initial and decimated sequences, co- 
incide. This self-similarity property is demonstrated in 
Fig.El 
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The exponent in Zipf 's distribution Eq. Q is governed 
by the ratio n/L. In the case of weak correlations, at n — > 
oo, the value of C tends to zero and the Zipf distribution 
appears to be destroyed, i.e. all words of the length L 
occurs with the almost same probabilities W — 2^^. The 
opposite situation, n/L —f 0, corresponds to the strong 
correlations in the Markov chain. Eq. © shows in this 
case that the exponent ( tends to unity. The plot of Zipf 
distribution Eq. (|7|) with ( = 7/11 is demonstrated by 
the dash-dot-dotted line in Fig. Q] 

According to statements given in the literature, the 
Zipf law is associated with the property of scale invari- 
ance |10|, i.e. invariance of the slope of the Zipf plot 
with respect to certain decimation procedure. An anal- 
ogous property referred to as the self-similarity, appears 
in the frame of the model presented above. Let us reduce 
the iV-step Markov sequence by regularly (or randomly) 
removing some symbols and introduce the decimation pa- 
rameter A < 1 which represents the fraction of symbols 
kept in the chain. As is shown in Ref. [IjI, the reduced 
chain possesses the same statistical properties as the ini- 
tial one but is characterized by the renormalized memory 
length, TV*, and the persistence parameter, /z*, 



N* 



NX, 



A 



H = l-L 



1 - 2/i(l - A) ' 



(8) 



Indeed, the conditional probability p^ of occurring the 
symbol zero after k unities among the preceding N* sym- 
bols in the reduced chain is described by Eq. |^ where N 
and fj, should be replaced by the renormalized parameters 
iV* and fi* . Considering the Zipf law, we are interested 



FIG. 2: Rank distributions of the L-words with i = 14 in 
the TV-step Markov sequences reduced by randomly remov- 
ing some symbols. Solid line corresponds to the initial se- 
quence possessing A'^ = 32, n = 15. Symbols corresponding 
to the decimation parameter A = 2 {N* = 16) lie almost on 
the solid one. Other lines correspond to decimated sequences 
with A'^* < L. Specifically, dash-dotted, dash-dot-dotted, and 
dotted lines correspond to A^* = 8, A^* = 4, and to N* = 2, 
respectively. 

Now, let us study the rank distribution of L-words with 
L > N. This problem is not amenable to analytic calcula- 
tions, and, therefore, numerical simulations are applied. 
In this case, the above-mentioned degeneration of the 
probability of word occurring is non-existent. So, smear- 
ing of the steps in the rank distribution takes place at 
L > N. This smearing occurs gradually with an increase 
of the word length L and the steps appear to be com- 
pletely smoothed away at high enough values of L (see 
curves in Fig.|3J). This means that the Zipf law describes 
the rank distribution itself contrary to its envelope curve 
at L < TV. 

It is important to draw attention to the non- 
monotonous behavior of the Zipf slope ^ with an increase 
in the word length L. As is seen from Eq. ©, this param- 
eter increases at L < TV. This growth continues at L > TV 
as well but only up to a certain value of L = Lcr > TV. 
The maximum of C is observed at L = Lcr > N and 
then, a,t L > Lcr, C starts to decrease. This decrease is 
demonstrated in the inset to Fig. O 

It is necessary to note that the position L = Lcr of 




FIG. 3: The Zipf plots for L-words with L> N. Sohd, dash- 
dotted, and dashed hnes correspond to L = 12, 14, 18, respec- 
tively. The parameters of the Markov chain are N = 12, ji = 
0.1. The phenomenon of step smearing is observed with a 
growth of the word length L. In the inset: the Zipf plots for 
L-words with A'^ = 4, /x = 0.1 at L > Lcr (the lengths of words 
are shown near the curves). 



maximum in the C(-^) dependence is strongly related to 
the characteristic correlation length 2lc+N in the Markov 
chain being studied. According to Ref. [l^, the symbols 
correlate with each other not only within the memory 
length N but within the enlarged region 2lc + N where 
Ic represents the characteristic attenuation length of the 
fluctuations. Thus, the best fitting of the rank distribu- 
tion of words in the Markov chain by the power-law curve 
is achieved if the size of the competitive words is close to 
the correlation length. 

If the words are shorter than the correlation length, 
L < N + 2lc, the specific features of the correlations be- 
come apparent in the rank distribution that results in de- 
viations from the Zipf law. In the system that is consid- 
ered in this paper, the deviations manifest themselves in 
the appearance of the steps in the rank distribution and 
in the additional weak logarithmic multiplier in Eq. ^. 
Moreover, at very small iV ~ 1 the rank distribution 
deviates significantly from the power law and gets the 
exponential shape at iV = 1. 

In the opposite limiting case, at L ^ iV -I- 2lc, the 
correlations over the whole word length disappear and 
the rank distribution tends to the constant. It is im- 
portant to note that specifically persistent correlations 
in the Markov chain (that correspond to the attraction 
between the same symbols) lead to the pronounced Zipf 



law in the rank distribution of words. Indeed, the thin 
solid line in Fig.^demonstrates the very weak W{R) de- 
pendence for the case of the antipersistent correlations, 
at fi = -15/46. 

Thus, the obtained results allow us to suggest the fol- 
lowing physical picture of the appearance of the Zipf law. 
The correlations should be presented in the system but 
the noise of sufficient strength should be imposed on these 
correlations over the length of competitive words. Within 
the considered stochastic system, this noise is provided 
by sufficiently strong fluctuations observed on the scales 
of the word length. The role of the noise consists in con- 
cealing the specific peculiarities of the correlations. Ow- 
ing to the fluctuations, the concrete shape of the correla- 
tions does not appear to be very important. Accordingly, 
the Zipf law is a consequence of the rather rapidly damp- 
ing persistent correlations of quite arbitrary form, i.e. the 
global correlations in the system are not necessary. The 
Zipf law is a manifestation of the inner microstructure of 
the system being a result of attraction between building 
blocks of the same kind. 
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